Skip to content

Refactor HTMLReader to improve encoding detection logic and HTMLWriter to AutoClosable #1419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

miurahr
Copy link
Member

@miurahr miurahr commented May 23, 2025

This PR refactors the HTMLReader class to improve readability, modularity, and make encoding detection more robust. Key improvements include replacing manual encoding detection with helper methods and migrating from String to the Charset type for encoding handling.

Pull request type

  • refactor

Which ticket is resolved?

There is no issue to point

What does this PR change?

  • Refactored the createReader method and extracted detectBOM and detectEncodingFromContent methods for better modularity.
  • Replaced the encoding type from String to Charset to enhance type safety and consistency.
  • Consolidated repeated charset detection logic into the new detectCharset helper method.
  • Streamlined file encoding handling by adding better fallback mechanisms.
  • Removed redundant and unused code, improving maintainability and reducing complexity.

Other information

This comment was marked as outdated.

miurahr added 2 commits May 23, 2025 19:25
- Separated BOM detection, content-based encoding detection, and charset parsing into distinct methods for clarity and reusability.
- Replaced redundant code with a streamlined structure, enhancing maintainability and readability.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
(cherry picked from commit 28daf3c)
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Introduced AutoCloseable implementation to streamline resource handling and added @OverRide annotations for better code clarity. Refactored fields to final, ensuring immutability and improving thread safety, while simplifying the writer functionality.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr miurahr force-pushed the topic/miurahr/filters2/html/refactor-reader-creation branch from 22a0cd7 to 2694720 Compare May 23, 2025 10:36
@miurahr miurahr changed the title Refactor HTMLReader to improve encoding detection logic Refactor HTMLReader to improve encoding detection logic and HTMLWriter to AutoClosable May 23, 2025

This comment was marked as outdated.

miurahr added 2 commits May 23, 2025 22:29
Introduced AbstractHtmlReader as a base class for HTMLReader to simplify encoding and file handling. Moved encoding detection logic to HTMLUtils for reusability and better separation of concerns. This enhances code modularity and reduces duplication.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Enhanced error messages for test assertions to improve clarity when tests fail. This includes specifying encoding expectations and pinpointing differences in file size or content during comparisons.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr miurahr force-pushed the topic/miurahr/filters2/html/refactor-reader-creation branch from 69fc264 to 389f0ec Compare May 24, 2025 02:46

This comment was marked as outdated.

Replaced the hardcoded "UTF-8" encoding in test assertions with `Charset.defaultCharset()` to align with system default behavior. This ensures better platform compatibility and correct encoding detection verification.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr miurahr force-pushed the topic/miurahr/filters2/html/refactor-reader-creation branch from 389f0ec to ef0bb61 Compare May 24, 2025 03:40
miurahr added 3 commits May 24, 2025 13:51
- Reformatted code for better readability, using consistent indentation and line wrapping.
- Enhanced encoding detection by introducing fallback logic and replacing String-based default encoding with a Charset-based approach.
- Added JetBrains annotations (@NotNull, @nullable) to improve clarity and null safety.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
…ng assertion

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr miurahr marked this pull request as ready for review June 2, 2025 02:17
miurahr added 2 commits June 2, 2025 14:03
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Added tests to verify correct encoding detection for files with UTF-8, UTF-16 BE/LE BOM, and files without BOM. Introduced new test files to support encoding validation. Improved comments in existing tests for better readability.

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Copy link

github-actions bot commented Jun 2, 2025

❌ Quality checks failed.

Please look a Gradle Scan page for details:
https://gradle.com/s/ccwjuicvlhinc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant